From Data to the p-Adic or Ultrametric Model
نویسنده
چکیده
We model anomaly and change in data by embedding the data in an ultrametric space. Taking our initial data as cross-tabulation counts (or other input data formats), Correspondence Analysis allows us to endow the information space with a Euclidean metric. We then model anomaly or change by an induced ultrametric. The induced ultrametric that we are particularly interested in takes a sequential – e.g. temporal – ordering of the data into account. We apply this work to the flow of narrative expressed in the film script of the Casablanca movie; and to the evolution between 1988 and 2004 of the Colombian social conflict and violence. 1 Modeling of Anomaly or Change: Introduction The data mining and data analysis challenges addressed are the following. (i) Great masses of data, textual and otherwise, need to be exploited and decisions need to be made. Correspondence Analysis handles multivariate numerical and symbolic data with ease. (ii) Structures and interrelationships evolve in time. (iii) We must consider a complex web of relationships. (iv) We need to address all these issues from data sets and data flows. Various aspects of how we respond to these challenges will be discussed in this article, complemented by the Appendix. We will look at how this works, using the Casablanca film script, and data from the long Colombian civil strife involving government, guerrillas, paramilitaries and civilians. 2 The Geometry and Topology of Information We consider Correspondence Analysis and hierarchical clustering as a semantic analysis platform. To illustrate our description, we will take film script, the 1 ar X iv :0 80 9. 04 92 v1 [ st at .M L ] 2 S ep 2 00 8 semi-structured expression of a story. Film script is the starting point of what may become a movie. For McKee [4], film script text is the “sensory surface of a work of art” and reflects the underlying emotion or perception. Our data mining approach models and tracks these underlying aspects in the data. Our approach to textual data mining has a range of novel elements. The starting point for analysis is frequency of occurrence data, typically the ordered scenes crossed by all words used in the script. If the totality of interrelationships is one facet of semantics, then another is anomaly (or change, novelty, breakpoint) as modeled by a clustering hierarchy. If, therefore, a scene is quite different from immediately previous scenes, then it will be incorporated into the hierarchy at a high level. This novel view of hierarchy will be discussed further in section 2.1 below. We draw on these two vantage points on semantics – viz. totality of interrelationships, and using a hierarchy to express change. See [1] for other work that uses p-adic metric properties, tantamount to ultrametric properties, for the same goal of change detection. 2.1 Modeling Semantics via the Geometry and Topology of Information Some underlying principles are as follows. We start with the cross-tabulation data, scenes × attributes. Scenes and attributes are embedded in a metric space. This is how we are probing the geometry of information, which is a term and viewpoint used by [13]. Underpinning the display in Figure 3 is a Euclidean embedding. The triangular inequality holds for metrics. An example of a metric is the Euclidean distance, exemplified in Figure 1, where each and every triplet of points satisfies the relationship: d(x, z) ≤ d(x, y) + d(y, z) for distance d. Two other relationships also must hold. These are symmetry and positive definiteness, respectively: d(x, y) = d(y, x), and d(x, y) > 0 if x 6= y, d(x, y) = 0 if x = y. Further underlying principles used in Figure 3 are as follows. The axes are the principal axes of momentum. Identical principles are used as in classical mechanics. The scenes are located as weighted averages of all associated attributes; and vice versa. Huyghens’ theorem relates to decomposition of inertia of a cloud of points. This is the basis of Correspondence Analysis. We come now to a different principle: that of the topology of information. The particular topology used is that of hierarchy. Euclidean embedding provides a very good starting point to look at hierarchical relationships. An innovation in our work is as follows: the hierarchy takes sequence, e.g. timeline, into account. This captures, in a more easily understood way, the notions of novelty, anomaly or change. Let us take an informal case study to see how this works. Consider the situation of seeking documents based on titles. If the target population has at least one document that is close to the query, then this is (let us assume)
منابع مشابه
, interbasin kinetics and ultrametric diffusion
We discuss the interbasin kinetics approximation for random walk on a complex landscape. We show that for a generic landscape the corresponding model of interbasin kinetics is equivalent to an ultrametric diffusion, generated by an ultrametric pseudodifferential operator on the ultrametric space related to the tree of basins. The simplest example of ultrametric diffusion of this kind is describ...
متن کاملModeling Fluid's Dynamics with Master Equations in Ultrametric Spaces Representing the Treelike Structure of Capillary Networks
We present a new conceptual approach for modeling of fluid flows in random porous media based on explicit exploration of the treelike geometry of complex capillary networks. Such patterns can be represented mathematically as ultrametric spaces and the dynamics of fluids by ultrametric diffusion. The images of p-adic fields, extracted from the real multiscale rock samples and from some reference...
متن کاملUltrametrics in the genetic code and the genome
Ultrametric approach to the genetic code and the genome is considered and developed. p-Adic degeneracy of the genetic code is pointed out. Ultrametric tree of the codon space is presented. It is shown that codons and amino acids can be treated as p-adic ultrametric networks. Ultrametric modification of the Hamming distance is defined and noted how it can be useful. Ultrametric approach with p-a...
متن کاملHierarchical model of the actomyosin molecular motor based on ultrametric diffusion with drift
We discuss the approach to investigation of molecular machines using systems of integro– differential ultrametric (p-adic) reaction–diffusion equations with drift. This approach combines the features of continuous and discrete dynamic models. We apply this model to investigation of actomyosin molecular motor. The introduced system of equations is solved analytically using p-adic wavelet theory....
متن کاملua nt - p h / 05 08 21 3 v 1 2 9 A ug 2 00 5 Localization for free ultrametric quantum particle
Free evolution for quantum particle in generic ultrametric space is considered. We prove that if mean zero wave packet is localized in some space domain then its evolution remains localized in the same domain. In the present note we consider ultrametric quantum mechanics with real time and ultrametric space. Quantum mechanics with ultrametric (p–adic and adelic) space was considered in many wor...
متن کاملA p-Adic Model of DNA Sequence and Genetic Code
Using basic properties of p-adic numbers, we consider a simple new approach to describe main aspects of DNA sequence and genetic code. Central role in our investigation plays an ultrametric p-adic information space which basic elements are nucleotides, codons and genes. We show that a 5-adic model is appropriate for DNA sequence. This 5-adic model, combined with 2-adic distance, is also suitabl...
متن کامل